=============================================================

Data Source and Background Information

The US Environmental Protection Agency maintains several public databases on environmental data.1 RadNet2 is a system of geographically distributed monitoring stations which sample and test for a number of analytes (e.g., gross beta (β), cesium-137, iodine-131) in the nation’s air, precipitation, and drinking water. RadNet provides historical data to estimate long-term trends in environmental radiation levels and as a means to estimate levels of radioactivity in the environment. Stations are located across the US as well as American Territories. The database primarily consists of data collected since 1978 though some data dates back to 1973 from RadNet’s precursor, ERAMS (Environmental Radiation Ambient Monitoring System).

All data for this study was downloaded from the RadNet website in September 2017 as csv files. According to the EPA’s Envirofacts Data Service API data output is limited to 10000 rows of data at a time from a maximum of three tables. Consequently, the data for each material was downloaded by decade (up-to 1989, 1990-1999, 2000-2009, 2010-2017) and the resulting .csv files were loaded into RStudio and then merged3 into a single R dataframe.

Data Overview and Tidying

The original raw consolidated dataframe has 19 variables and 504092 observations. One variable was added in the merge operation tracking the media dataframe from which the observation originated. From the first cursory glance, variables of interest were monitoring station locations, sample types (material) collected, and analytes. Dates and analysis amounts will be helpful in the investigation of these variables.

Variable names were shortened and several variables were recast as factors with ordered levels where it made sense that an inmposed order may be useful, e.g., S(econd) < M(inute) < H(our) < D(ay) < Y(ear). In the event of possible date manipulations, the date field was changed to type DATE instead of CHR.4

Univariate Investigations

Univariate Plots, Tables and Summaries

A quick summaries of the variables show the most common values and components of the database. Air-filter samples dominate as does the result unit pCi/m3 while the analytical sample size is split between liters (L) and cubic meters (m3).

Location Numbers, Cities, and States

The location numbers range from 1 to 4157, but there are only 324 unique sampling locations with most values <500, the distribution is a long tail right skeew. The location numbers correspond to 289 cities or regions in the US, US territories, the Pananma Canal (PC) or Ottawa, Canada (ON). Several cities have multiple monitoring stations, e.g., Oakridge, Tennesse has 11 different monitoring location numbers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    55.0   116.0   536.8   194.0  4157.0

The entries are not evenly distributed amongst the monitoring sites. Some sites have have \(\geq\) 12,000 entries while others have only a single entry. Highly monitored states have \(\geq\) 20,000 entries while some areas have < 500 entries. The top and bottom five locations, by city and state, are tabularized by number of observations for that location or area.

Geographical Representations

A simple search turned up the R package ggmap5 which makes it possible to represent location information geographically.6,[^ggmap_vis2] The packages choroplethr and choroplethrMaps also allow for geographical representation of various values on a shaded and keyed Choropleth Map.7,8 Using the provided city and state information, the variables, latitude and longitude, were added for each location using the geocode() function.9,10

Most of the monitoring locations can be represented on a cropped map of North America. The eastern seaboard has the most monitoring stations. Filtering data by lat/long we find 12 monitoring locations not represented on this map.

Filtering again we can look at regional monitoring stations such as those in Hawaii and Alaska on different types of state maps.

Material ID -> Sample Types

A bar graph yields a good visual of the distribution of sample material. As noted above, air samples dominate with nearly \(\frac{1}{2}\) of the entries, nearly a quarter million observations.

Sample IDs and Analytes

Just over \(\frac{1}{2}\) of the sample ID numbers are distinct and 84% of these are single entries. The remaining sample IDs have as many as 25 entries. This should not be surprising as there are 61 different analytes for which each sample might be analysed, though Gross Beta accounts for nearly \(\frac{1}{2}\) of the observations. The top 10 most analysed analytes account for ~85% of the dataframe observations.

Sample Sizes and Units of Measure

A distribution of sample size values reveals groupings at 1, 5 and 5,000, however, this is not a valid comparison of size as unit size is needed to provide context. For example, 1 L sample is 1000 times the size of a 1 mL sample. Units can also indicate sample type, about \(\frac{1}{2}\) of the measured units are \(m^3\) generally used for gases, most of the other half are liquids (mL and L); a very small fraction are solids (mg and g) and a few entries have no designated unit.

This ties well to the earlier observation that the predominant sample type is air-filter. The next biggest group, liter (L), matches well with the remaining sample types being liquids, i.e, precipitation, milk, and water (drinking & surface)

## 
##             G      L     M3     MG     ML 
##   3206     15 231564 250729    434  18144
Analytical Procedures and Duration

Procedures 1 and 9 account for 74% of the entries though there are 35 analytical procedure numbers. Procedure 1 is the mode11 of this variable with ~42% of the entries. A small high density grouping about 120 shows procedures 118 and 119 accounting for another 19 % of the entries. Thus, 93% of the entries are analyzed with one of four procedures. A duration variable indicates that while most tests take under 20 hours there are some tests/procedures which are over 80 hours.

Result Amount, MDC, CSU, and Units of Measure

Result amounts, like sample sizes, cannot be directly compared as this is more or less meaningless without a Result Unit. However, a plot of the raw result values show the greatest density around 0.01. Overlaying the minimum detectable concentration (MDC) and combined standard uncertainty (CSU) shows that both MDC and CSU have their greatest density at <0.001, a factor of 10 lower than the results. Though this would need confirmation of the unit of measurement the distribution is reassuring as the uncertainties and detection limits should be much less than the actual measurment.

Result units themselves are distributed similar to the sample units with about \(\frac{1}{2}\) of the measured units for gases (pCi/m3 and aCi/m3).

## 
## ACI/M3 DPM/GC    G/L  PCI/L PCI/M3 
##  22966     63  10585 242708 227770

Interestingly, 16% (81,921) of the observations have no result entry (result amount is NA). This seems odd as RadNet is designed specifically for the documentation of analtye concentrations, i.e., results. New variables were derived from the full and NA result entries to numerically represent complete 1 and empty 0 to allow for easy proportional and other analyses.

summary(rad_data_raw$RESULT_AMOUNT)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##   -200.00      0.01      0.01    134.93      1.48 257000.00     81921
Result Dates

A simple distribution of entries over result dates, yields typical daily count ranges of 125 to 250 entries, but regular spikes of 50% to 100% above the baseline are noted. These spikes occur on regular intervals; until 2009 the spikes are at mid-year (7/1) afterwhich they appear year-end (12/31). Grouping the observations by a 2 year date range gives a bar plot in which you can see a drop in entries around 1989 and again 10 years later, though there is a subtle increase from 1999 to 2012. The drop in 2017 cannot be verified as the data was only downloaded through September.

Analysis Types and Half-Lives

There are two types of analyses: radioactive (R) 97.9 % and elemental (E) 2.1%. At first it might seem odd that only 50% of the entries include a radioactive half-life (\(t_{1/2}\)). However, half-life is only defined for the single isotope of an particular element and gross beta which is not isotope specific accounts for 50% of the entries and a quick filter confirms that the half-life entries for beta are NA’s.

## 
##      E      R 
##  10585 493507
##   ANALYTE_ID          HALF_LIFE     
##  Length:232387      Min.   : NA     
##  Class :character   1st Qu.: NA     
##  Mode  :character   Median : NA     
##                     Mean   :NaN     
##                     3rd Qu.: NA     
##                     Max.   : NA     
##                     NA's   :232387

Of the 61 analytes, 56 elements have half-lives, but the top 10 account for a majority of the remaining dataframe.

Half-life values by themselves are like sample size and result amount and cannot be compared directly. However, the values in this case can be easily converted to a single time unit, e.g., years and then plotted. This wonderfully captures the breadth of half-lives that is involved within this group of isotopes. The shortest-lived isotope, radon-219 has \(t_{1/2}\) of 3.9 seconds compared to the longest-lived isotope, lanthanum-138 with \(t_{1/2}\) = 1.05x1011, 105 billion years.

Univariate Analysis and Review

Dataset structure:

The original dataframe had 504,092 entries with 19 variables. The variables represent factors (some levelled), numerical values, characters and dates. Some of the variables were recast for uniformity and to help data manipulation and representation. Most notably the dates were converted to date characters in ISO 8601 format.

Main features of interest:

  • Material ID: milk, air: filter and charcoal, water: surface and precipitation
  • Location: Geographic information over monitoring (sampling) locations
  • Result Amount
    • comparative amounts
    • empty result entries (NANs)

Support features in the dataset:

Dates should be very useful in investigating the time component to the information. Result values in combination with Location IDs and dates would help isolate groups of data and identify any radioactive release events or concerns. Analytes tested should also help to subset and classify data to provide better comparisons. Additionally, Sample IDs may help to find duplicate entries or to better understand missing data.

New variables created from existing variables:

Several new variables were created for the handling of the geographical information, e.g., latitude and longitude. The R packages ggmap and choroplethr and the API with geocode all require specific formatting of input data which in turn required reformatting of variables, e.g., NM to new mexico. Three variables were added to clarify and provide better handling of the empty result amount (NA). One variable complete is a simple character variable with Y(es) and N(o) to identify whether there is a result value or not. Two numeric variables were added with the hopes to better quantify and classify the Y=1 and N=0 values.

Dates were grouped into different time spans to provide a way for a more simplified examination of the dataframe as well as to investigate duplicate Sample IDs where entries for different analytes are made on different days.

Data Cleansing/Unusual Distributions:

In general, cleaning and date munging involved verifying and reformatting location data. Column names were tidied and shortened. A few database issues were found, e.g.: the abbreviation PC was not recognized by the Google API and was changed to “Panama Canal”; the city of Doswell, SC does not appear to exist, but several Doswell, VA observations were found in the dataframe so the single Doswell, SC entry was changed to VA. Several entries were listed as EPA Regions 1 to 10. In an effort to geo-locate the data, each region number was changed to the location of the headquarters for that EPA region.12.

Because of the number of missing result entries a duplicate dataframe (rad_data) was created which filtered out the NA rows for RESULT_AMOUNT.

A couple other features of note: 1) There is a long tailed positive skew for the distribution of sample replicates. While there is the possibility for multi-analyte analysis most samples are primarily sampled for gross beta analysis. 2) The periodic spikes in entries occurs on July 1 and December 31. Is this a side effect of the database, precipitated by some administrative speciifcation or were there really that many year-end entries, on New Year’s Eve?

Bivariate Investigations

Bivariate Plots

Correlations

A heat map correlation matrix indicates that most of the numerical variables are not well correlated. The new variables Y/N are negatively correlated and the measurement variables (result amount, CSU and MDC) are stongly correlated.

A general correlation matrix13 on a 10% sample of the database reveals how the majority of the observations may be classified. Air-filter entries as the largest material group are easy to follow within the matrix. Air-filter samples are usually analysed with procedure 1 for radioactive components, reported in pCi/m3 and are by-and-large completed entries. At the opposite end of the spectrum, there are so few elemental (E) entries these are also easy to see in the matrix. They are grouped nicely with respect to the other variables. The elemental analyses are for pastuerized milk, predominantly by procedure 9, reported as g/L and are all completed. A further filter reveals that the specific analytes are calcium (Ca) and potassium (K).

Variables with multiple levels, such as cities and states are not easy to include in this type of matrix. To provide correlations for observations by location some grouping would be necessary.

Monitoring Locations and Entries

When monitoring stations are grouped within a region there is a not surprising good correlation (r=0.87) to the number of observations. Essentially, with more monitoring stations more entries would be expected. However, the correlation is weaker (r=0.79) when examining the number of empty results to the number of regional monitoring stations.

Using choropleth maps we can visualize the observations entered by state and compare information geographically. It is easy to pick out individual states with moderate/low number of observations and higher percentages of empty results, e.g., Kentucky with <4,500 entries has 18%% empty, while California with 23,686 entries has 13% NA results.

However, a simple summary of the percent empty observations shows the median percent empty entries for region is 15.9%, this is the same as in the general database. Besides the few outliers, empty entries are not a location issue.

Hawaiian Subset

Regional subsets of the data were useful to simplify initial data manipulations before working with the entire data set. Observations from Hawaii were filtered and using a simple state overaly by count shows the majority of samples are from Honolulu. Using a simple material ID facet, we see Honolulu has the most varied of sample types and air-filters are the most common material sampled.

Examination of the material types in Hawaii yields some interesting observations. Air-filter samples are the majority of samples and they are almost all complete. Air-charcoal samples have the most empty entries by percent, but the total number of empty entries is more than 5 times for precipitation samples. There are no surface water analyses entered for Hawaii.

In 2011 an anomalous spike of empty entries, for which there is a corresponding spike of completed entries. Using a material filled distribution we see the spike to be composed of air-charcoal and air-filter samples. As air-charcoal samples were phased out in the 1980s it seems possible that the charcoal samples were entered in error and re-entered as air-filter samples.

Further examination of the Hawaiian air samples confirms that around April 2011, 378 air analyses were entered. One-half as air-charcoal (190) with 128 empty and the remainder as air-filter analyses (188) with only 21 incomplete. However, upon a closer look, the air-filter sample spike is actualy earlier than the air-charcoal spike.

So while interesting, more information is needed to establish the actual between the empty air-charcoal and air-filter samples. A quick survey of the databases shows that only a few locations have empty air-charcoal entries with Alaska and Hawaii having the greatest number.

Empty Entries and Completed Samples All Regions

To examine the preponderance of empty entries in the entire database, a table of proportions was generated and frequency distributions of the complete/incomplete observations plotted by date. Interestingly, empty (incomplete) entries only exist from 1990 to 2011. During this 20 year interval, the ratio of empty/completed entries is fairly steady at ~0.37.

Cross-examination of complete entries by analyte, shows of the 61 analytes only 15 have observations with an empty RESULT_AMOUNT. Cesium-237 tops the list with 17,938 empty entries. More interesting though is Thorium-234 with 163 empty entries, but no completed entries.

Empty entries by material ID reveals that proportionally, Air-Charcoal is the most incomplete while Air-Filter is the most complete. On an actual count basis Precipitation and Pasteurized Milk have the most blank result values. Just two analytical procedure numbers have empty entries, procedures 9 and 118.

Sample IDs - Samples and Tests

The number of distinct sample IDs is 272,004, 54% of the database. If distinct is applied pair-wise with another category, such as location, the number of samples does not change indicating sample IDs do not appear to be used across multiple locations. However, when grouped by both sample ID and result date fewer samples are duplicated, indicating that results for one Sample ID may be entered on different dates. If grouped over a date range (e.g, 6 mos, 1 yrs, 2 yrs) the number of duplicate entries decreases. This essentially limits the life-time of a Sample ID and indicates IDs are not recycled. The distinct Paired Sample ID-Analyte ID confirms that 98% of the data (494249) is unique within these two variables. Consequently, the reason for duplication sample IDs is that there are multi-analytes for the sample.

## [1] "Distinct Sample IDs:  " "272004"
## [1] "Distinct Sample IDs by location:  "
## [2] "272004"
## [1] "Distinct Sample IDs by result date:  "
## [2] "298576"
## [1] "Distinct Sample IDs by Date Range (6 mos) and Location:  "
## [2] "279265"
## [1] "Distinct Sample IDs by Date Range (1 year) and Location:  "
## [2] "277163"
## [1] "Distinct Sample IDs by Date Range (2 years) and Location:  "
## [2] "274677"
## [1] "Distinct Sample IDs by Analyte:  " "494249"

Thus, to collect together all parts of a sample, the dataframe was grouped across 4 categories (material ID, date range (2yr), location, and sample ID). This grouping was developed to determine a status for the sample: complete - all analytes have results, incomplete - some analytes have empty results, and empty - no results entered for any of the analytes. Using this status categorization there are 274677 samples, 93.5% of which are complete, 6.4% are incomplete, and 0.1% are empty. With the grouped data we can visualize the status of a sample by the number of tests (entries). Most samples have only 1 test, those that are empty have about 5 tests while incomplete samples have a range of entries(tests), mostly from 4 to 10.

## 
##      C      E      I 
## 256946    285  17446

Result Values by Sample Type

While it was noted that result values themselves cannot be compared directly without ensuring the same unit of measure, we can group like samples and plot values filtered by a unit. Filtering for the chief analyte Beta and pCi/m3 within the Hawaiian dataset a plot by date yields a two peaks above background. The dates of these events are just after the nuclear accidents at the Chernobyl and Fukushima nuclear power plants in May 1986 and March 2011.

The entire data set was then evaluated in the same fashion and the spikes for Fukushima and Chernobyl are a bit changed. Beta result values in other locations reversed the relative heights of the peaks. In addition, two new spikes emerge, one in March 1981 and another in July 2008.

Other analytes can be similarly examined. Radium-226, the 10th most common analyte, is most often reported in picocuries per liter (pCi/L). Filtering for this subset and plotting yields a much less populated graph with consistently low values until 2012 when the plot seem much noiser and negative values are reported.

However, a simple overlay of the detection limit and uncertainty changes the perspective. On a log scale we see both detection limits and uncertainties are above the average reported value, thus reducing the perceived significance of the peak point > 25 pCi/L.

Bivariate Analysis and Review

Relationships Observed in bivariate investigations

The material ID appears to be the most convenient way to quickly subset and classify the observations. As seen in the univariate analysis Air-Filter dominates and a correlation matrix confirmed that Air-Filter samples coincide with the top categorical results: result unit pCi/m3, analytical procedure 1, type Radioactive. Air-Filters also comprise the most complete entries, Y.

Location data, while informative and visually interesting, does not easily correlate to the other variables in the data. Yes, more observations are noted in areas with more monitoring stations. Not surprisingly, there are more monotoring stations located in areas of higher population and locations near known radioactive material handling, e.g. Oak Ridge, TN; Carlsbad, NM; Richland, WA. The incomplete entries with a few exceptions are generally as in the database about 16% of the entries by location.

The empty entries do have a strong correlation to material ID. Air-Charcoal entries are predominantly incomplete. Milk and precipitation samples have more incomplete entries percentage-wise than the remaining material types. The largest group of sample material, Air-Filter, has the least incomplete entries.

Interesting relationships between minor features

The Element category is a small subset of the data and is not generally split in any other categories. Most succinctly, Element entries consists of completed drinking water samples reported in g/L. The Element entries are split by analytical procedure with most being analysed by procedure 9 and a few analysed by procedure 57. On this limited set it was easy to tie the procedures to the analytes potassium (K) and calcium (Ca), respectively.

One somewhat surprising result is the weak correlation (0.42) of analytical size to the result amount value. Sample sizes are often specified by procedure and it should not be expected that the result values would be related to the size of the sample. However, it is important to remember that the results and size are not standardized to for actual comparison, so this correlation is purely a relationship between the values not any actual magnitude. In this case, I would theorize that the relationship is forced due to limited analyses in which the result values and sample sizes possess low variance.

The strongest relationship

The strongest correlations existing in the original dataframe is between the result amounts, minimal detectable concentration and combined standard uncertainty values. This is to be anticipated as both MDC and CSU are determined/calculated through the analyses of samples and results. Again caution should be used here as the values are not normalized to actual magnitude without the measurement unit.

The perfect negative (-1) correlation between the Y/N status variables merely confirms that these variables were set up correctly. If the entry is complete (1) it cannot also be incomplete (0) and vice versa.

Multivariate Investigations

Multivariate Plots Section

Simple Facets and Fills

With simple facet or variable fills of earlier plots more relational information between the variables can be elucidated. When the entry count histogram is facted by material ID as expected air-filter entries dominate and the analyte ID fill shows gross beta as the predominant analyte. Though it is difficult to pick out the exact analytes, we can tell that gross beta is a much smaller percentage of the other materials. Surface water and air-charcoal have the least entries.

With the distribution of samples over the various monitoring stations positively skewed, just like the skew of the monitoring station location ID. However, if sample counts are displayed over region they look at least more evenly distributed because the regions are not organized in the same fashion and do not have as many unused identifiers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    55.0   116.0   536.8   194.0  4157.0

When sample counts are plotted by year, we see that pasteurized milk and precipitation were analysed more often in the 1980s-1990s, but appear to have been discontinued/deprioritized whereas the number of air-filter samples has picked up in the last 20 years. Surface water analysis seems to have been terminated in 2000.

Focusing only on the next five “second tier” analytes, there appears to be a fairly regular distribution of these analyted entris across the various monitoring sites.

Result Amounts and Locations - Beta Analyses

Using beta analyes results we can facet by location to get a different picture of the release events noted earlier. The Hawaiian data set shows that the Chernobyl event was picked up at the Honolulu station(s), which appears to be the only station monotoring at that time, but the Fukushima release was picked up by both Kahuku and Kauai spikes in gross beta.

Looking at the entire dataset in the same manner we can conclude that the July 2008 appears localized to Tennessee, while the March 1981 incident was picked up at several monitoring stations.

No information could be found on nuclear incidents during 2008. The only nuclear accident found during in Feb/March 1981 was at the Tusraga Power Plant14. The RadNet data was entered March 3 by a Nevada station however, reports have this incident later, on March 8th15 or 9th.16. However, a reports were covered up for some 40 days and earlier issues at the power plant went unreported, it is still possible these spikes are related to the Tsuraga event.

Data was filtered and grouped for the month after the Fukushima incident (March to April 11, 2011). The summarized data shows that max beta measurements across the US are definitely higher on the west coast during this month versus measurements made in the eastern continental US. The same data in the previous year was summarized to provide a baseline comparison. The 2010 values are less than a tenth of the 2011 values. Additionally, the pattern across the US is different with maximum values being higher across the northern border instead of the coast.

More localized results can be seen by plotting the maximum beta results by monitoring station. Higher values are notable in Las Vegas, Boise, southern California, Arizona, and oddly Montgomery, Alabama.

Not many other radionuclides were analysed during this time, but plot of the other analytes (with the same measurement unit pCi/m3) during this event shows some areas with peaks up to 37 times that of the gross beta, e.g. 37 pCi/m3.

Measurements in Alaska after Fukushima were almost twice the contiguous US response, 1.15 pCi/m3 vs. 0.6 pCi/m3 gross beta max. Other radionuclides were detected at even higher values in AK, 144 pCi/m3 for Bismuth-212, 126 pCi/m3 for Lead-212 and 59 pCi/m3 for Thallium-208.

Sample Status/Incomplete Empties

Using a simple material ID facet shows that the empty observations are mostly precipitation/milk samples with very few air-filter samples.

However, when we group by sample ID only 285 samples (0.1%) of the 272004 distinct samples are truly empty and 6% are incomplete. When faceted by status, the empty samples are difficult to picture.

## 
##      C      E      I 
## 256946    285  17446

A deep dive view of the empty samples reveals that they are predominantly drinking water samples entered between 1998-2008. Thus, the empty entries seen for the other materials must be part of larger sample that is now ranked as incomplete.

To show the status of samples the proportion of completeness (incompleteness) is faceted by material. The pComplete is complimentary to the pIncomplete and consequently the boxplots are a goup of inversed mirror images. However, we can easily see that air-filter, drinking water and surface water samples are largely complete with some sample outliers. The dispersion is greater for the other material categories. Both air-charcoal and pastuerized milk completed samples are left skewed for completeness with the median near the third-quartile. The distribution for precipitation is the most normal with the median centered between the 2nd and 3rd quartiles.

A quick look of the geographical informatrion for drinking water samples shows that the empty and inclomplete samples are spread across the various monitoring stations. No one region appears to have an issue.

Multivariate Analysis and Review

Examining the relationships between multiple variables confirmed deductions made via the cursory univariate review. The majority observations between material ID (air-filter), analyte (beta) and unit (pCi/m3) are in fact correlated. One caution is to be aware of the data and its set up. The distribution of samples at first appeared to have some geographic importance when examined by location number. However, remembering that the univariate analysis showed an extreme right skewed assignment of location numbers led to another look. And while some locations do have more observations than others the distribution is more even when looking by actual geographic locations or regions.

Interesting and Surprising Feature Interactions

Entries without result values seem to have more of an interesting story and more difficult to track than first thought. These entries at first glance are a fairly significant portion of the database (16%). However when grouped as samples (sample ID and date) reveals that most empty observations are just multiple analyte tests on the same sample. So for the most part each sample has some information. With the exception of a few locations, such as the air-charcoal samples in Hawaii, status and location are not correlated.


Final Plots and Summary

Plot One

RadNet Overview

This North American map provides a nice synopsis of the RadNet Network from 1978 to 2017. Cities with monitoring stations are marked with a colored dot indicating the number of stations in that city. Additionally, the size of the dot conveys the total number of entries into the database by that city. While this map leaves out a number of monitoring stations in the Pacific Ocean it gives a nice simple overview as to the expanse and impact of the network.

Plot Two

Periodic Entry Spikes and Empty Result Entries

These empty and completed entry distribution yields a slightly different perspective on the daily entries. In the univariate analysis we saw periodic spikes occuring with regularity on July 1 and December 31. The data was filtered for just these dates and then plotted as a barplot. In addition, lines for the total number of entries each month and the average nunber of entries per day for that month were added. In general, we see the expected, the spikes are much greater than what would be expected for an average daily entry and less than the month’s total. The facet grid by empty/complete is a good imitation of that seen earlier. However, there is a new surprising feature, when distinguished by material ID the samples entered on the July 1 and December 31 groups are revealed as predominantly drinking water samples. While the distribution is of the same shape, the “background” samples covered up the fact that the periodic events are limited to the drinking water material type.

Plot Three

Sample Status Faceted by Material Type

These double-faceted bar-plots are clean elegant way to clarify the status of samples within the database and provide easy classification possibilities. The samples are identified as complete (all entries with the same sample ID have results), empty (all entries for the sample ID are NA) or incomplete (mixed result values and NA’s). The separation by material type allows an easy comparison between sample types. Finally, by using a violin bar-plot a density function of the number of analyses (entries) per sample is readily illustrated.


Reflection

Overall it was amazing to discover the number of avenues for exploration that could be found. The difficulty was getting drawn deeper into each examination. The number of ways to approach a solution is suprising, but it is as daunting as it is comforting. Other struggles varied from simple wrangling and tidying data to determining the proper commands/functions to use for each situation. And then re-wrangling to fit the data into a function or suitable format.

Some simple data changes required judgement calls, for example, the city of Doswell, SC as entered in the database does not appear to exist, but several Doswell, VA observations were found in the dataframe so the single entry was changed from SC to VA. All entries listed only as EPA Regions (1-10) were changed to that regions headquarters17 in an effort to geo-locate data. A small dataframe of unique city-state combinations was created rather than inundate the geocode API with multiple requests for the same location. This effort was partly responsible for the discovery of multiple monitoring stations in a single city. Many issues were in fact discovered through the analysis of other parts of the data.

All variables were analysed in the univariate section which was overwhelming and may have led to some disjointed yet interesting analyses. This also lead to a myriad of analyses that needed to be culled and shortened to present a more cohesive presentation.

Some of the biggest struggles where in finding a decent way to present the geographic data. Beyond reconfiguring variables to fulfill command and argument requirements the data sometimes had to be refined to where it was most meaningful or expressive. For example, as powerful as the visual is, a choropleth map can only express one value per region, one must subset the data to fit the region and decide what datum has the best impact, median, max, min? I find these graphs to have more general appeal and easier to interpret once they are setup.

One of the most frustrating issues was tidying up and generation of the report. KNITR has issues that were unrelated to its error messages. After plotting one graph with 3 different approaches, all of which worked fine in the RStudio console it the extraneous columns were removed from the dataframe used in the ggplot() command and the issue disappeared from KNITR. RStudio seems to have a display/timing issue. If a chunk has a lot of data manipulation before rendering a graph sometimes the display will try to render the graph before the data is compiled. This leaves a blank screen in the console. Re-running just the plotting code will then work.

As the intent of the RadNet database is to provide information on radioctive exposure and background. It was surprising to find missing data as interesting as some of the results. As I do not feel I investigated the results themselves in-depth there are many avenues for exploration left available. One could examine correlations between analytes, differences in the analytes and any release events. A closer inspection of detection limits and uncertainties with respect to analytes and results would also be interesting. A few regions were ignored and perhaps would show different impacts by the different release events. An in-depth look at a small region with respect to all the analytes might reveal observations of a regional interest.


  1. https://www.epa.gov/enviro/envirofacts-overview

  2. https://www.epa.gov/enviro/radnet-overview

  3. https://psychwire.wordpress.com/2011/06/03/merge-all-files-in-a-directory-using-r-into-a-single-dataframe/

  4. https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/as.Date.

  5. http://cran.r-project.org/web/packages/ggmap/ggmap.pdf

  6. <https://blog.dominodatalab.com/geographic-visualization-with-rs-ggmaps/

  7. https://en.wikipedia.org/wiki/Choropleth_map

  8. https://www.rdocumentation.org/packages/choroplethr/versions/3.6.1

  9. https://www.rdocumentation.org/packages/ggmap/versions/2.6.1/topics/geocode

  10. http://www.storybench.org/geocode-csv-addresses-r/

  11. Getmode function https://www.tutorialspoint.com/r/r_mean_median_mode.htm

  12. https://www.epa.gov/aboutepa#pane-4

  13. ^ggpairs

  14. https://en.wikipedia.org/wiki/Tsuruga_Nuclear_Power_Plant

  15. http://timshorrock.com/wp-content/uploads/Chronology-of-1981-Tsuruga-Accident-from-Japanese-Press.pdf

  16. https://www.history.com/this-day-in-history/japanese-power-plant-leaks-radioactive-waste

  17. https://www.epa.gov/aboutepa#pane-4